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INTRODUCTION 


Classification  and  taxonomy  problems  are  pervasive  in  the  design  of  weapons  systems.  The 
increasing  capabilities  of  sensors  and  signal  processing  computers  require  ever  more  sophisticated 
algorithms.  Two-dimensional  signals  are  becoming  available  in  many  missile  systems.  Infrared  and 
inverse  synthetic  aperture  radar  provide  two-dimensional  signals  for  guidance  and  target  selection  systems. 
One-dimensional  RF  signals  arise  from  range  only  radar  profiles  (active  RF)  and  emitter  recognition 
(passive  RF).  Regardless  of  the  signal  domain  involved,  classification  of  the  unknown  object,  or  objects, 
generating  or  reflecting  the  signal  is  often  the  primary  task  of  the  signal  processing  algorithms. 

The  classification  procedure  assumed  here  consists  of  the  design  and  implementation  of  a 
transformation,  equivalently  a  function  or  mapping,  from  the  set  of  input  patterns  to  a  set  of  desired 
outputs.  The  resulting  transformation  is  called  a  classifier:  This  conforms  to  the  types  of  models  usually 
assumed  in  the  pattern  recognition  literature.  The  following  mathematical  definitions  give  simplified 
versions  of  two  basic  pattern  classification  models  (References  1  and  2). 

Model  1.  The  discrete  model. 

U  is  the  universe  of  objects  (patterns). 

11  is  a  partition  (Ult ....  UK)  of  U  into  disjoint  subsets. 

Xisa  collection  {Xi, ....  XK)  of  finite  sets  satisfying  Xi  c  U;  for  1  <  i  <  K. 

X  is  known  and  11  is  unknown. 

Classification  consists  of  the  design  and  implementation  of  algorithms  for  determining  to 

which  ty  unlabeled  objects  u  belong. 

Model  2.  The  statistical  model. 

U  is  the  universe  of  objects  (patterns). 

7"  is  a  collection  {f] . fK]  of  probability  density  functions  (pdfs)  on  U. 

For  1  <,  i  <,  K,  Xj  is  a  finite  sample  from  the  pdf  fj,  and  X  is  the  collection  of  X/s. 

X  is  known  and  7  is  unknown. 

Classification  consists  of  the  design  and  implementation  of  algorithms  for  determining 

from  which  f;  objects  u  were  sampled. 

The  geometric  results  presented  in  this  report  are  motivated  primarily  by  Model  1.  Model  2  is  the 
more  appealing  in  many  classification  applications.  However,  there  is  an  enormous  gap  between  the 
known  data  set  X  and  the  unknown  pdfs.  One  must  make  assumptions  regarding  the  types  of  distributions 
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which  are  possible  for  the  fj’s.  A  good  understanding  of  the  underlying  physical  process  usually  helps 
determine  the  pdfs.  In  the  presence  of  such  understanding,  the  data  set  serves  to  validate  the  hypotheses 
regarding  the  pdfs.  Model  1  assumes  that  the  data  set  X  is  the  only  information  available. 

The  set  of  input  patterns  is  usually  a  real  vector  space,  while  the  final  outputs  consist  of  a  discrete 
set  of  class  labels.  Typically,  the  classifier  employs  intermediate  outputs  which  also  lie  in  a  real  vector 
space.  The  major  analysis  task  is  to  determine  a  single  mapping  that  sends  the  input  patterns  from  distinct 
classes  into  well-separated,  easily  recognizable  regions  of  the  intermediate  output  space.  From  the 
intermediate  output  space,  the  mapping  to  the  class  labels  should  then  be  trivial.  In  the  past  most 
classifiers  were  implemented  on  von  Neumann  computers.  The  advances  of  parallel  processing  technology 
suggest  new  approaches  to  classifier  design. 

Of  the  many  types  of  neurocomputing  devices  currently  discussed  in  the  engineering  literature, 
perhaps  the  simplest  is  the  feed  forward  layered  neural  network  (LNN).  This  network  is  an  obvious 
candidate  for  application  to  classification  problems  where  the  input  patterns  reside  in  a  real  vector  space  of 
fixed  dimension. 

The  LNN  takes  as  input  the  d  coordinates  of  the  pattern  x  and  produces  an  m-dimensional  output 
vector  u.  The  output  vectors  are  selected  so  as  to  facilitate  the  decision  regarding  the  class  to  which  x 
belongs. 

For  a  K-class  problem,  one  typically  takes  the  output  dimension  m  to  be  K  and  the  desired  output 
vectors  to  be  the  K  elementary  unit  vectors 

ei  =  (0,  0. ....  0,  1,  0, ...  0)T 

with  1  in  the  i*  coordinate.  The  network  is  designed  and  the  weights  are  determined  in  order  to  map  every 
member  of  the  i*  class  into  a  small  neighborhood  of  e^  These  two  steps — network  design  and  weight 
assignment — give  rise  to  interesting  problems  in  the  geometry  of  finite  dimensional  Euclidean  spaces. 

Step  1.  Design  the  network;  that  is,  determine  the  number  of  layers  and  the  number  of  neurons  in 
each  layer. 

Step  2.  Determine  the  weights  (one  for  each  connection  in  the  network)  so  as  to  produce  the 
desired  network  mapping. 

The  two  steps  are  clearly  related.  If  the  network  does  not  accommodate  the  complexity  of  the 
classification  problem,  then  Step  2  will  be  impossible.  Here  we  give  no  precise  definition  of  complexity. 
Roughly  speaking,  the  problem  complexity  grows  with  the  number  of  classes,  the  numbers  of  clusters 
within  the  classes,  and  the  number  of  surfaces  required  to  separate  all  pairs  of  interclass  clusters  (References 
1  and  3).  Appropriate  weights  may  not  exist  if  the  number  of  connections  in  the  network  is  too  small 
(Reference  4). 

We  will  discuss  in  some  detail  the  number  of  first-layer  neurons  required  by  a  threshold  network  to 
separate  K  convex  classes  in  d-dimensional  space.  It  will  be  shown  in  Sections  3  and  4  that  this  number 
can  range  from  lg(K)  to  at  most  (K2  -  K)/2,  when  K  <  d  +  2.  For  K  >  d  +  2,  the  upper  bound  is  at  least 

(d  +  1)K  -  (d  +  1)  (d  +  2)12. 

For  K  =  10  and  d  =  8,  this  gives  a  range  of  4  to  45  first-layer  neurons.  For  nonconvex  classes,  there  is  no 
upper  bound. 
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The  number  of  classes  is  usually  fixed.  Although  the  dimension  of  the  raw  input  patterns  is  also 
fixed,  the  number  d,  of  inputs  to  the  LNN,  may  depend  upon  preprocessing  and/or  feature  selection.  These 
procedures  generally  reduce  the  input  dimension,  whereas  addition  of  monomial  features  increases  the  input 
dimension  (References  1  and  5). 

The  space  of  mappings  — >  R*m*  associated  with  a  fixed  network  architecture  is  parameterized  by 

the  neuron  transfer  function  s,  called  the  squashing  function,  and  the  weights  on  the  connections.  Results 
presented  here  will  pertain  mostly  to  threshold  transfer  functions. 

Section  2  presents  some  basic  notation  and  terminology.  Properties  of  decompositions  by 
hyperplanes  and  their  relevance  to  networks  of  threshold  neurons  are  discussed  in  Section  3.  Separation 
properties  of  disjoint  convex  sets  are  presented  in  Section  4. 


NOTATION 

is  the  space  of  real  d-dimensional  column  vectors 

y  *  (yj.  y2.  •••.  y«t)T- 

For  every  y  e  R(d\  we  define  the  extended  column  vector  y+  by 

y+  =  <yi.  y»  ••••  y«i-  !)T- 

A  X  by  d  +  1  matrix  A  defines  an  affine  mapping  y  -*  A(y)  from  R(d^  to  R^  as  follows: 

A(y)  =  Ay+. 

Formally,  we  define  the  architecture  of  a  LNN  to  be  an  ordered  triple  (s,  h.  A),  where  s  is  the  neuron 
transfer  function,  h  is  a  positive  integer,  and  A  is  an  (h  +  l)-tuple 

(^-o>  ^i»  — »  ^-h) 

of  positive  integers.  The  function  s  satisfies 

-1  <  s(t)  £  1  for  all  real  t, 
t  <  u  =>  s(t)  <  s(u)  for  all  real  t,  u, 
lim[s(t) :  t  -»  -oo]  =  -1, 

lim[s(t) ;  t  ->  -oo]  =  l. 


By  a  A-network  we  mean  a  LNN,  perhaps  with  unspecified  neuron  transfer  function,  having  layers  described 
by  A.  Figure  1  shows  a  (2,  3, 1)  network.  The  extra  nodes  at  layers  0  and  1  have  constant  value  1  in  order 
to  allow  affine  mappings  (with  nonzero  bias)  rather  than  pure  linear  mappings. 
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O  =  node  (neuron)  with  transfer  function  s 
0  =  node  with  identity  transfer  function 

FIGURE  1.  (2,  3,  1)  LNN. 


The  network  mapping  F  is  the  composition  of  h  single-layer  mappings,  each  of  which  is  the 
composition  of  an  affine  mapping  and  a  'squashing'  function  S. 

F  =  fh  °  fh-t  0  -  o  f2  o  f,. 

where 

fh  =  S  o  Ah, 

S  =  (s,  s,  ....  s), 

ah(l,  1)  a},(l,  2)  —  ah(l,  Xh_j+ 1) 

ah(2,  1)  ah(2,2)  -  ah(2,  A„_i  +  1) 

Ah  =  ,  . 

*  •  • 

ah(Xh,  1)  ah(Xh>  2)  •••  Xh_j  +  1) 
and  the  j*  coordinate  of  Ah(x)  is 

V: 

ah(i»  Ki +  D  +  2^«h0. 0^ 

i=l 

for  1  <  j  <  Xh. 

S  is  a  vector  of  functions  each  of  whose  coordinates  is  s.  For  y  =  (ylt  y2, ....  y*,)  e  R^, 

S(y)  =  (s(yj),  s(y2) . s(y*)). 
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Technically,  we  have  a  sequence  of  functions  Slt  S2 . where 

Sx: 

and  I  denotes  the  unit  interval  [0, 1].  In  order  to  simplify  the  notation,  we  use  the  unsubscripted  S  for  all 
when  the  dimension  of  the  domain  is  understood.  For  the  (2,  3,  1)  network  of  Figure  1,  the  network 

mapping  F  is  given  by 

u  »  F(x)  =  S(A2(S(A1(x)))), 

where 

aj(l,  1)  a,(l,2)  a,(l,  3) 

Aj  =  aj(2,  1)  a, (2,  2)  a,(2,  3) 
ajG,  1)  a, (3,  2)  aj(3,  3) 
and 

=  ^  82(1,2)  a2(l,  3)  ajG,  4)J. 

For  X  c  R(d\  Hull(X)  denotes  the  convex  hull  of  X.  An  extreme  point  of  a  convex  set  C  is  a 
point  c  satisfying 

E c C and c e  Hull(E)  =>ceE, 

That  is,  c  is  not  a  convex  combination  of  other  members  of  C. 

A  convex  polytope  is  the  convex  hull  of  a  Finite  set  of  points.  For  X  Finite  and  P  =  Hull(X),  the  set 
of  vertices  V  of  P  is  the  set  of  extreme  points  of  P.  A  convex  N-gon  is  a  convex  polytope  in  R  .  A 
Finite  set  X  in  R(d)  is  the  set  of  vertices  of  a  convex  polytope  if — and  only  if — 

X  n  Hull  (Y)  =  Y 

for  all  Y  c  X. 

T tl  denotes  the  smallest  integer  not  smaller  than  t 


and 


LtJ  denotes  the  largest  integer  not  larger  than  t  for  all  t  e  R.  For  y  e  ||y||  is  the 

Euclidean  norm: 


llyll  =  (y?  +  yf  +  -  +  yl)m- 
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HYPERPLANES  AND  THRESHOLD  NEURONS 

Our  objective  in  this  section  is  to  describe,  using  results  from  combinatorial  geometry,  how  neuron 
requirements  depend  upon  the  configuration  of  input  patterns.  Reference  6  establishes  a  lower  bound  on  the 
number  of  first-layer  threshold  neurons  required,  using  the  formula  of  Theorem  1  below.  We  extend  this 
result  by  showing  that  the  first-layer  neuron  requirement  is  problem  dependent  That  is,  the  lower  bound  on 
the  number  of  first-layer  threshold  neurons,  as  a  function  of  the  set  of  data  points,  becomes  much  larger 
than  that  of  Reference  6. 

Combinatorial  analysis  of  arrangements  of  hyperplanes  in  R^d)  provides  the  foundation  for  threshold 
LNNs.  References  1, 7  and  8  contain  good  introductions  to  pattern  recognition,  combinatorial  geometry, 
and  convexity,  respectively.  We  assume  here  that  the  reader  is  familiar  with  the  basics  of  linear  algebra 
(Reference  9).  However,  for  the  reader's  convenience,  we  present  some  standard  definitions  and  well-known 
facts. 


An  affine  subspace  in  R^  is  a  translate  of  a  (linear)  subspace.  A  k-flat  in  R(d*  is  a  k-dimensional 
affine  subspace.  That  is,  H  is  a  k-flat  provided  H  =  U  +  v  for  some  k-dimensional  subspace  U  and  some 
v  g  R. 

We  adopt  the  convention  that  the  empty  set  is  the  unique  k-flat  for  all  negative  k,  whereas  every 
singleton  (y }  is  a  0-flat. 

DEFINITION  1.1.  A  hyperspace  in  R^  is  a  linear  subspace  of  dimension  d  -  1. 

DEFINITION  1.2.  A  hyperplane  in  R(d^  is  a  translate  of  a  hyperspace.  That  is,  H  is  a 
hyperplane  in  R(d)  provided  there  exist  a  hyperspace  U  and  a  vector  v,  both  in  R(d),  such  that  H  =  U  +  v. 

FACT  1.  U  is  a  hyperspace  in  R(d)  provided  there  exists  a  nonzero  vector  a  in  R(d)  such  that 
U=  {y€  R(d):ay  =0}, 
where  a  •  y  denotes  the  inner  product  of  a  and  y, 

d 

a-y  =  I>i 

1=1 

FACT  2.  H  is  a  hyperplane  in  R^d)  provided  there  exists  a  nonzero  vector  a  in  R^  and  a  scalar  b 
such  that 


H=(ye  R(d):  a  -  y-b>0}. 

It  will  prove  helpful  to  view  the  set  of  hypeiplanes  as  characterized  both  by  Definition  1 .2  and  Fact  2. 

Suppose  M  is  finite  set  of  hyperplanes  in  R^d\  and  1  <  e  <  d+1.  We  say  that  M  is  in  general 
position  provided  that  the  dimension  of  the  intersection  of  any  e-subset  of  #is  d  -  e. 

DEFINITION  2.1.  An  open  half-space  H°  in  R(d^  is  a  subset  satisfying 

H°=  (ye  R(d) :  ay  -  b < 0} 
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where  a  e  and  be  R.  Similarly  a  closed  half-space  H"  in  R(d^  is  a  subset  satisfying 
H"  =  jy  e  R(d) :  a  y  -  b  >  0). 

DEFINITION  2.2.  A  convex  polyhedron  is  the  intersection  of  a  finite  number  of  closed  half¬ 
spaces. 

From  the  preceding  definition  it  follows  that  a  convex  is  poly  tope  is  a  convex  polyhedron. 
However,  a  convex  polyhedron  need  not  be  a  polytope,  since  polytopes  must  be  compact  (bounded).  A 
compact  convex  polyhedron  is  convex  polytope. 

Now  consider  a  LNN  T=  (t,  h  A),  where  x  is  the  threshold  transfer  function  satisfying 

-1  for  t  <  0 

T(t)  = 

1  for  0  <  t 

For  weight  matrices  At,  A2, ....  Ah,  the  network  mapping  F  is  given  by 

F(x)  =  fh(fh.,(...f2(f1(x)) ...)), 

fj=  To  A j,  for  1  <j<h. 

T  =  (t,  X,  ....  x)T. 

Let  G(x)  denote  the  result  of  mapping  x  through  only  the  first  layer  of  neurons,  i.e.,  the  second  layer  of 
nodes.  Equivalently,  G  equals  fj  and  is  the  mapping  for  the  network  (x,  1,  M),  where  M  =  (d.  A),  d  =  Afl, 
and  A  =  Ai .  We  focus  attention  on  the  network  M  and  its  mapping  G  for  the  following  two  reasons.  If 

G(x)  =  G(y)  for  x,ye  R(d\  then  F(x)  =  F(y).  Thus,  input  patterns  that  are  to  be  mapped  into  different 
outputs  must  be  separated  by  G.  This  applies  to  all  LNNs,  but  is  not  of  great  consequence  when  s  is 
injective  (one-to-one).  The  second  reason  applies  specifically  to  threshold  neurons;  namely,  the  range  of 
the  set- valued  function. 

G'1  :  Rw  ->  2(R<d>) 

consists  of  a  finite  number  of  disjoint  convex  polyhedra  in  R^d\  Therefore,  many  properties  of  threshold 
networks  depend  largely  upon  decompositions  of  R(d^  into  polyhedra. 

For  the  remainder  of  this  section,  we  let  A  =  Aj,  the  single  A  by  d  +  1  weight  matrix  that 
defines  G. 

The  following  theorems  and  corollaries  illuminate  the  relationship  between  the  weight  matrix  A  and 
the  decomposition  of  the  input  space  R(d\  We  let 

G(x)  =  (G,(x),G2(x) . Gx(x)), 


where 


G.(x)  =  x(a(j)-x+), 
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and  a(j)  is  the  j*  row  of  A.  Recalling  that  the  rows  a(j)  of  A  are  (d  +  l)-vectors,  we  denote  by  a'(j)  the 
truncated  d-dimensional  row  vector 

a'(j)  =  (a(j.  1).  a(j,  2), ....  a(j,  d)). 

In  order  to  avoid  discussing  degenerate  configurations,  we  assume  that  the  truncated  row  vectors  a'(j)> 

1  <  j  <  X,  lie  m  general  position  in  R(dl  That  is,  for  1  <,  e  S  d,  every  e-subset  of  the  a‘(j)s  is  linearly 
independent 

LEMMA  1.  G^f-l)  and  G~1(l)  are  complementary  open  and  closed  half-spaces  in  R(d), 
respectively,  for  1  <  j  £  X. 

LEMMA  2.  For  cr  e  {-1, 1}\  the  closure  of  G_1(a)  is  a  convex  polyhedron  in  R(d\ 


\ 

PROOF.  G‘’(o)  =  O  G'^a.).  Since  G'^o)  is  the  intersection  of  finitely  many  half-spaces,  its 
closure  is  a  convex  polyhedron.  (We  consider  the  empty  set  to  be  a  convex  polyhedron.) 

THEOREM  1  (References  6  and  7).  Let  #be  a  set  of  X  hyperplanes  in  general  position  in  R(d), 
then  R(d)  -  U#is  the  union  of  Regd(X)  connected  components,  each  of  which  is  a  convex  polyhedron, 
where 


Regd(X)  = 


/'lN 


+  ...  + 


DEFINITION  3.  For  X  c  R*d\  we  say  that  a  partition  {X  j,  X2)  of  X  into  two  disjoint  subsets  is 
linearly  separable  if  there  exists  a  hyperplane  H  that  separates  every  pair  of  points  (xj,  x2)  with  x,  e  Xj, 
x2  g  X2 


FACT  3.  {Xj,  X2)  is  linearly  separable  provided  there  exists  a  vector  a  and  a  scalar  b  such  that 

<  0  if  x  e  Xj  ) 

a-x  -  b  i 

>  0  if  x  g  X2  )  . 

DEFINITION  4.  Suppose  that  X  -  {Xj,  X2, ...,  XK}  is  a  finite  family  of  subsets  of  R^d)  and 
9{  =  {Hj,  H2> ....  Hj,)  is  a  family  of  hyperplanes  in  R(d\  We  say  that  separates  X  if  for  every  x,  g  X,, 
Xj  g  Xj,  1  S  i  <  j  S  K,  there  is  at  least  one  member  of  M  that  separates  xi  and  Xj. 

FACT  4.  A  finite  set  !H  of  hyperplanes  separates  a  finite  family  X  of  sets  in  R^d)  if  and  only  if 
every  connected  component  of  R(d)  -  U#  contains  members  of  at  most  one  member  of  X  and  Uitfis 
disjoint  form  KJX. 
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THEOREM  2  (Reference  10).  Let  X  be  a  set  of  N  points  in  geneial  position  in  R(d\  then  -Jie 
number  of  linearly  separable  partitions  of  X  into  two  disjoint  subsets  is  Sepd(N),  where 


f  N  -  l'j 

(n-  i\ 

-  1N 

Sepd(N)  = 

+ 

U...+ 

v  0  > 

^  i  ) 

,  d  > 

REMARKS.  Theorems  1  and  2  are  'almost  dual'  to  one  another.  By  moving  from  R(d)  to 
projective  space  P*d)  with  appropriately  modified  definitions,  lines  and  planes  may  be  interchanged  by 
projective  duality.  The  discrepancy  between  the  formulas  for  RegjfX)  and  Sepd(N)  results  from  the  different 

topologies  of  R(d^  and  P(d).  This  difference  is  exemplified  by  the  fact  that  a  projective  hyperplane  does  not 
disconnect  P^d\  while  two  projective  hyperolanes  decompose  into  two  disjoint  components.  Of  the 
Regd(X)  components  in  R^  determined  by  X  hyperplanes,  2Regd_i(X-l)  of  them  are  infinite.  These  infinite 
regions  occur  in  Regd.j(X-l)  pairs  that  are  connected  when  transformed  into  projective  space  P^d\  Thus,  the 
number  of  connected  components  determined  by  X  lines  in  P^  is,  in  fact,  Sepd(X),  as  one  would  expect 
from  duality. 

EXAMPLE  1.  For  d  =  2  and  X  =  4,  we  have 


Reg2(4)  = 


Figure  2  shows  four  lines  in  the  plane  and  the  resulting  decomposition  into  1 1  regions;  three  finite  and 
eight  infinite. 


FIGURE  2.  Eleven  Regions  in  R(2)  Determined  by  Four  Lines. 
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EXAMPLE  2.  Figure  Ka)  shows  a  set  Y  =  {yj,  y2,  y3,  y4}  of  four  points  in  R(2). 
shows  the  seven  linear  separations  of  Y. 


yi 

(a) 

y4 

y2  ya 


^  Z1 

(b) 

Z3  z4 


FIGURE  3.  Quadruples  Y  and  Z  in  R®. 


TABLE  1(a).  Separable  Partitions  of  Y  in  Figure  3(a). 


Y, 

Y2 

0 

fyi  y2  ys  y<) 

(yiJ 

(y2  ys  y4) 

to) 

toy3y4) 

to) 

(yiy2y4) 

to  y2) 

toy4) 

to  y<t) 

toys) 

tyiYs) 

toy4) 

EXAMPLE  3.  Figure  3(b)  shows  a  set  Z  =  (zj,  z2,  z3,  z4)  of  four  points  in  R(d\ 
shows  the  seven  linear  separations  of  Z. 


Table  1(a) 


Table  1(b) 
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TABLE  1(b).  Separable  Partitions  of  Z  in  Figure  3(b). 


Zl 

Z2 

0 

{ZjZ2Z3Z4} 

(Zl) 

{Z2  Z3  Z4) 

(Z2> 

{Z,Z3Z4) 

{Z3} 

{Z,Z2Z4) 

(z4) 

U1Z2Z3} 

{Zi  Z4) 

{22X3} 

{Z,Z2) 

{Z3Z4} 

REMARKS.  The  4-sets  of  Examples  2  and  3  both  admit  seven  linearly  separable  partitions  as 
indicated  by  Theorem  2.  However,  the  sets  of  partitions  are  not  isomorphic.  That  is,  there  exists  no 
mapping  from  Y  to  Z  that  sends  the  linearly  separable  partitions  of  Y  into  those  of  Z.  This  stems  from  the 
fact  that  Y  and  Z  represent  different  order  types  in  R®.  Reference  1 1  presents  definitions  and  basic  results 
on  order  types  in  Euclidean  spaces. 

The  fact  that  the  partition 

{Zj.Zj}  {Z2,z4) 

is  not  linearly  separable  is  what  prevents  one  from  'solving'  the  exclusive-or  problem  with  X  =  1 
(Reference  12). 

The  planar  exclusive-or  problem  leads  us  naturally  into  pattern  recognition  in  Euclidean  spaces.  We 
adopt  the  following  simple  model.  We  are  given  a  family 

*=  (Xj,  X2, ....  XK) 
of  K  disjoint  finite  subsets  of  R(d\  with 

X  =X,UX2U  ...  U  XK 
|Xjl  =njf  l£j£K 
I X I  =  N  =  nj  +  2  +  ...  +  nK. 

This  corresponds  to  a  K-class  problem  for  which  X;  is  the  training  sample  for  Class  i.  The  task  is  to  define 
a  neural  network  (or  some  other  type  of  classification  device)  whose  mapping  F  satisfies 

JF(Xi)  -  u  Jl  S  e,  for  all  Xj  €  X*.  (1.1) 

These  conditions  force  F  to  map  Xj  into  a  sphere  of  radius  e  centered  at  Ui.  The  Ui  are  distinct  points  in 
R(m\  the  output  space,  and  e  is  some  small  allowable  error,  in  particular 
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2e  <  ||uj  -  Uj||  for  1  £  i  <  j  £  K,  (1.2) 

so  that  the  K  target  spheres  in  are  disjoint. 

Our  results  on  neuron  requirements  pertain  only  to  the  number  X  =  Xj  of  first-layer  threshold  neurons 
required  in  a  threshold  network  for  the  network  mapping  F  to  achieve  the  classification  objective.  We 
appeal  to  the  following  obvious  fact 

FACT  5.  If  the  network  mapping  F  of  a  threshold  network  satisfies  Equations  1.1  and  1.2,  then 
for  all  Xj  €  Xj,  and  Xj  e  Xj,  1  S  i  <  j  <  K: 

G(x,)*G(Xj). 

Lemma  3  follows  immediately. 

LEMMA  3.  If  there  exist  weights  for  a  threshold  network  T  =  (t,  h,  A)  such  that  the  resulting  F 
satisfies  Equation  1,  then  the  set  of  hyperplanes  determined  in  R(d)  by  the  Xj  first-layer  neurons  separates  X 

In  order  to  relate  threshold  neuron  requirements  to  sets  of  training  data,  we  introduce  six 
combinatorial  functions.  In  the  following  definition,  point  sets  and  sets  of  hyperplanes  are  assumed  to  be 
in  general  position  in  R^. 

DEFINITIONS.  For  X  a  finite  subset  of  R(d\  and  X=  (X1(X2 . XK)a  partition  of  X  into  K 

disjoint  subsets: 

Xmin(X/JQ  =  min  {X :  there  exists  a  set  #of  X  hyperplanes  which  separates  X) .  (5.1) 

For  a  partition  %=  {nlt  n2,  ...,nK}  of  N  into  K  positive  integers, 

^minminfd.  N.  »0  =  "lin  {X^fX/ X))  (5.2) 

*™xxnm  (d,  N,*0  =  max  (X^QQX))  (5.3) 

where  the  minimum  in  Equation  5.2  and  the  maximum  in  Equation  5.3  are  taken  over  all  X,  X  such  that  X 
is  an  N-subset  of  R(d)  and  X is  a  partition  of  X  into  K  disjoint  subsets  with  cardinalities  ni(  1  <  i  <  K 

WX)  =  WXAS) 

where  ^denotes  the  family  of  singletons  { (x) :  x  e  X)  (5.4) 

\nmnun(d,  N)  =  min  (X^QC) :  X  c  R(d)  and  I X  |  =  N)  (5.5) 

Xm«min (d.  N)  =  max  {X^ro  :  X  c  R(d)  and  I X  |. N).  (5.6) 

For  a  K-class  training  set  X,  with 
X=  (Xj,  X2, ....  XK), 

the  number  of  first-layer  threshold  neurons  required  is  at  least  Xmin(X/X).  If  each  X,  contains  a  single 
prototype,  then  at  least  X^fX)  first-layer  threshold  neurons  are  required. 


14 


NWC  TP  7106 


LEMMA  4  (Reference  6).  If  X  is  an  N-subset  of  R(d\  then 
X^OO  ^  min  {X :  Regd(X)  £  N}. 

PROOF.  Suppose  Regd(p)  <  N.  A  set  tWof  p  hyperplanes  in  R^  decomposes  R^d)  into  less  than 
regions,  so  at  least  two  of  the  N  members  of  X  must  lie  in  the  same  region.  Thus,  X  is  not  separated  by 
The  conclusion  of  the  lemma  follows  immediately. 

Lemma  4  gives  a  lower  bound  for  X^CX)  in  terms  of  I  X  I .  Thus,  we  have  a  lower  bound  for 
Xminmin  (d,  N).  This  lower  bound  is,  in  fact,  sharp. 

THEOREM  3. 

AttuumnCd,  N)  =  min  {X :  Regd(X)  >  N) . 

PROOF.  Let  #be  a  set  of  p  hyperplanes  in  R(d),  where 
p  =  min  {X :  Regd(X)  >  N) . 

9{ decomposes  R(d)  into  r  =  Regd(p)  disjoint  regions.  Select  a  point  from  each  of  the  regions,  and  let  X  be 
an  N-subset  of  the  selected  r-set.  This  is  possible  because  r  >  N.  separates  X,  so  Xmin(X)  <  p.  But 
XjnjnfX)  >p  by  Lemma  4.  Thus,  X^^fX)  =  p,  and  the  theorem  follows. 

EXAMPLE  4.  Figure  4(a)  shows  a  set  W,  of  11  point-classes  (prototypes)  for  which 
Xmin(W1)  =  4.  Since  Reg2(4)  =  11,  no  set  of  11  points  in  RP)  can  be  separated  by  fewer  than  four  lines. 
Figure  4(b)  shows  a  set  W2  of  1 1  point-classes  for  which  Xmin(W?)  =  6. 
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Lemma  4  and  Theorem  3  determine  d,  N)  exacUy.  Clearly,  X^Jd,  N)  is  also  the  lower 

bound  on  the  number  of  threshold  neurons  required  to  separate  any  N-set  X  in  R(d).  As  X  varies  through  the 
N -subsets  of  R  ,  ^^(X)  is  bounded  above  by  XTO43tmm(d,  N).  The  set  W2  in  Figure  4(b)  shows  that 
^maxmin(2'  H)  ^  6-  This  is  a  special  case  of  the  following  theorem. 


FIGURE  4(b).  Eleven  points  in  R®  Requiring  Six  Lines  for  Separation. 

THEOREM  4. 

\n,«nin(2.N)^rN/2l. 

PROOF.  Let  X  be  the  set  of  N  vertices  of  a  convex  N-gon  C.  Let  B  denote  the  boundary  of  C.  B 
is  a  simple,  closed  polygonal  curve  containing  N  vertices,  the  members  of  X,  and  N  edges,  the  line 
segments  joining  consecutive  members  of  X.  Every  line  in  R(2)  intersects  B  in  at  most  two  points.  Since 
C  is  convex,  every  line  in  R(2)  intersects  at  most  two  of  the  edges  in  B.  Let  #be  a  set  of  X  lines  in  R(2) 
that  separate  X.  Since  each  member  of  meets  at  most  two  edges  in  B,  vX meets  at  most  2X  members 
of  B.  But  every  edge  must  be  cut  by  at  least  one  line  since  X  is  separated.  It  follows  that 
2A.  >  N  and  X  >  T N/2l . 

The  N-gon  in  R(2)  provides  an  N-set  that  is  difficult  to  separate,  with  difficulty  measured  by  the 
number  of  lines  required.  This  is  a  special  case  of  N-sets  lying  on  the  moment  curve  in  R^  (Reference  7). 

DEFINITION  6.  The  moment  curve  in  R(d)  is  the  set  defined  by 
M(d)={(t,t2,t3 . td):t€  R). 
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Finite  subsets  of  provide  interesting  examples  in  the  study  of  convex  polytopes.  If  X  is  a  finite 
subset  of  M(d),  then  every  point  of  X  is  an  extreme  point  of  Hull(X).  Furthermore,  X  is  difficult  to 
separate,  which  is  the  property  of  interest  here. 

LEMMA  5.  A  hyperplane  in  R(d)  cuts  M^  in  at  most  d  points. 

PROOF.  Let  H  be  a  hyperplane  in  R^.  From  Fact  2,  there  exist  a  nonzero  vector  a  and  a  scalar  b 
such  that 


H=(ye  R  :  a-y  -  b  =  0}. 


Suppose  yi  e  HnM(dl  Since  yi  e  M^,  there  exists  tj  e  R,  such  that  y{  =  (t,,  tf  ....  $.  Since  yd  e  H, 
we  have 

a  yj  -  b  =  ajtj  +  a2t?  +  ...  ajt-  -  b  =  0. 

Thus,  every  yi  e  HnM^  corresponds  to  a  root  of  the  polynomial 
fjj(t)  =  a^t  +  a2t  + ...  +  a^t  -  b. 

The  lemma  follows  from  the  fact  that  fH(t)  has  at  most  d  roots. 

LEMMA  6.  Suppose  X  c  M<d)  and  ! X 1  -  N.  Then  A^fX)  >f(N-  l)/d]  . 

PROOF.  We  may  assume  the  members  Xj  of  X  satisfy 
*i  =  (*i*  ••••  t£)» 

where 

li  <  t2  <  —  <  lN- 

Let  be  the  open  interval  in  M(d\  connecting  \  and  tj+1  for  1  <,  i  SN-1.  That  is, 

Mj  =  {(t,t2,  ...,td):ti<t<ti+1). 

The  ordering  of  the  tjS  guarantees  that  the  are  disjoint.  Now  suppose  that  M  is  a  family  of  A 
hyperplanes  that  separates  X.  From  Lemma  5  we  know  that  each  member  Hj,  1  <  j  S  A,  of  # cuts  M(d)  in 
at  most  d  points.  In  order  for  X  to  be  separated,  each  of  the  N  - 1  segments  Mi  must  be  cut  by  a  member  of 
9L  It  follows  that 

Ad  £  N  -  1 
and 

A>T(N-  l)/dl. 

THEOREM  5. 
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where 

Lmixinin(d.N)  =  r(N-l)/dl. 

PROOF.  This  inequality  follows  directly  from  Lemma  6  and  the  definition  of  X^^Cd,  N). 


THEOREM  6.  Discrete  Ham  Sandwich  Theorem  (Reference  7). 

Suppose  we  have  d  sets  Xj,  X2 . Xd  in  R(d\  with  !x;|  =  r^.  Then  there  exists  a  hyperplane  H 

such  that  H  bisects  every  Xj.  That  is,  for  1  S  i  <  d,  I  XjnH+  |  and  |  X/'iH'  |  differ  by  at  most  1.  For  n, 
odd  the  difference  is  1,  and  for  even  the  difference  is  0. 

PROOF.  See  the  Appendix. 

DEFINITION  7.  For  positive  integers  d,  N, 

r  =  1  +  Llg(d)J 

and 

rig(N)l  if  N  S  d+1  "j 

^mixinin =  \ 

r  +  P  ifN2>d+2.j 

Theorem  6  enables  us  to  bound  >.mixmin(d,  N)  above.  The  upper  bound,  LJmaxmin(d,  N),  is 
established  by  an  algorithm.  The  idea  is  best  seen  by  working  through  an  example. 

EXAMPLE  5.  Let  X  be  a  set  of  45  points  in  R(4).  We  invoke  Theorem  6  repeatedly  to  define  a 
sequence  Ht,  H2, ....  H13  of  hyperplanes  that  separate  X.  Initially,  the  best  we  can  do  is  select  Hj  to  bisect 

X.  This  gives  X^,  X^15  of  cardinalities  23  and  22.  Next,  we  select  H2  to  simultaneously  bisect  X^, 

giving  subsets  xf\  X^,  X^,  X^  of  cardinalities  12,  11,  11,  11.  Next,  we  select  H3  to  simultaneously 

bisect  all  four  X^'s.  This  gives  eight  subsets  xj^s  of  cardinalities  6,  6,  6,  5,  6,  5,  6,  5.  At  each 
remaining  step  we  can  bisect  four  of  the  existing  components.  We  select  the  largest  four  at  each  step. 

Thus,  at  step  4  we  bisect  four  of  the  five  X-^'s  of  cardinality  6.  This  gives  twelve  components  of 
cardinalities 


3,  3,  3,  3,  3,  3,  5,  3,  3,  5,  6,  5. 

The  next  four  to  be  bisected  have  cardinalities  5, 5, 6,  5.  The  process  yields  45  components  after  nine  more 
steps,  since  the  number  of  components  increases  by  four  at  each  step.  Therefore, 

W4.45)<13  =  Umlxmill(4,45). 
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Note  that  at  each  step  we  may  have  created  more  components  that  those  guaranteed  by  Theorem  6.  We 
ignore  this  possibility  and  continue  bisecting  subsets  as  if  they  had  not  been  cut  by  any  earlier  hyperplanes. 
The  general  algorithm  follows. 

SEPARATION  ALGORITHM. 

INPUT:  An  N-subset  X  of  R(d> 

OUTPUT:  A  set  X=  (H,.  H2,  ...1^} 
of  U  hyperplanes  that  separates  X, 


For  r  as  defined  above, 

2m  <  d, 

2f>d. 

If  N  <  d,  then  each  application  of  Theorem  6,  except  perhaps  the  last,  doubles  the  number  of  components. 
Thus,  we  obtain  a  separating  set  of  size  [" lg(N)l .  The  algorithm  requires  two  parts:  Steps  A  and  B  when 
N  >  d. 

Step  A  consists  of  Steps  A.k,  1  5  k  <  r. 

Step  A 

Step  A.l:  Choose  Hj  to  bisect  X  giving 
components  x^,  xjl 

Step  AJt:  Choose  Hk  to  bisect  the  2k_1 
current  components  X^'15. 

After  Step  A,  we  have  2r  components  Xj*. 

Step  B  consists  of  Steps  B.k,  1  <  k  <  s,  where 


Step  B 


Step  B  Jc:  Choose  HT+k  to  bisect  the  d  largest  of  the  current 
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components  ?^r+k-1), 

1  £  j£  2r+  d(k  - 1). 

After  Step  BJc,  there  are  at  least  2r  +  dk  components.  Thus,  the  algorithm  terminates  after  Step  B.s,  where 
s  is  the  smallest  integer  satisfying 

2r  +  ds  5  N. 

This  gives  a  total  of 

U=r+s=r+  hyperplanes 

that  separate  X.  It  follows  that 

W*)  *  Um.xmin(d,  N) 

for  X  c  R(d\  and  I X I  =  N.  This  result,  together  with  Theorems  3  and  5,  gives  the  following. 

THEOREM  7.  The  number  of  hyperplanes  required  to  separate  N  points  in  R(d)  always  lies 
between  Xminmin(d,  N)  and  Um„mjn  (d,  N)  with  XminmiB  and  Umtxmin  as  defined  in  Theorem  3  and 
Definition  7. 

The  lower  bound  is  sharp,  and  the  upper  bound  cannot  be  reduced  below 

N)  =[” ]. 

Equivalently, 

^mix  min(d,  N)  <  Xmtxmin(d,  N)  £ 

^maxmin  (d,  N). 

Table  2  shows  A.minmin(d,  200),  Lmtxmin(d,  200),  and  Umxxmin(d,  200)  for  several 
values  of  d. 


TABLE  2.  Bounds  for  ^(d,  200). 
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Although  Xmixmin  (d,  N,  ^appears  to  be  more  complex  than  (d,  N),  because  of  the 

additional  argument  it  can  be  bounded  using  the  preceding  results  together  with  another  example  from 
the  moment  curve, 

THEOREM  8.  Suppose  that  (n,,  n2, ....  nK)  is  a  partition  of  N,  with  the  parts  tij  labeled  so 
that 

nj  n2  >  ...  >  nK. 

Then 


N,  ^  <  LJmlxIIyI1(d,  N) 

(8.1) 

N,  S  Lmixmin(d,  N) 

if  2nj  <  N  +  1, 

(8.2) 

,  r  2N  -  2n,  ”1 

— ^--1  | 

if  2nj  >  N  +  1, 

(8.3) 

^tninmin(d.  N,  ^  >  ^ninmu^d,  K). 

(8.4) 

PROOF.  Any  set  of  hyperplanes  that  separates  (X,  5),  where  s  is  the  set  of  singletons  from  X, 
also  separates  (X,  X).  Thus,  Equation  8.1  follows  from  Theorem  7.  Similarly  Equation  8.4  follows  from 
the  fact  that  the  separating  hyperplanes  must  form  at  least  K  distinct  regions  in  R(d). 

The  lower  bounds  of  Equations  8.2  and  8.3  are  obtained  from  subsets  of  the  moment  curve  M(d).  Let 


Xj  =  (q,  tf,  ....  tf)  for  1  <  i  S  N, 

where  t(<  t2 ...  <  t^.  The  Xj's  are  consecutive  points  in  M(d).  We  now  assign  the  labels  1, 2, ....  K  to  the 

xj’s  with  frequencies  nlf  n2, ...,  nK,  respectively.  Let  b  =  (bj,  . b^,)  be  the  sequence  consisting  of 

nj  l's  followed  by  n2  2's  followed  by  n3  3's,  etc.  The  sequence  c  of  labels  is  defined  by 


c  =  (V  ^2’  bh+l»  b3-  ^+2-  •••)> 

where  h  =  L(N  +  3)/2j.  For  example,  with  d  =  3,  K  =  4,  N  *  15,  and  9i=  (6, 5, 2, 2),  we  have 


and 


b  =  (1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  3,  3,  4,  4) 
c  =  (1,  2,  1,  2,  1,  2,  1,  3,  1,  3,  1,  4,  2,  4,  2). 


In  this  case,  2nt  =  12  <  16  =  N  +  1,  and  at  least  5  =  \  14/3]  hyperplanes  are  required  to  separate  all  pairs 
with  different  labels. 


If  2nj  2  N  +  1,  then  n}  >  N  -  nj.  Thus,  the  number  of  l's  exceeds  the  number  of  remaining 
symbols.  Placing  the  remaining  symbols  in  separate  intervals  between  l's  produces  a  sequence  requiring 
two  cuts  for  each  symbol  different  from  1.  The  number  of  cuts  is  2(N  -  nj),  which  proves  Equation  8.3. 
As  an  example,  suppose  K  =  4,  N  =  15,  and  (10,  2,  2, 1).  There  are  nine  intervals  separating  the  ten 

l's,  and  126  =  (  j)  ways  of  placing  the  remaining  symbols  in  five  separate  intervals.  Each  of  the  resulting 
sequences  requires  10  cuts.  One  such  sequence  is 
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c  =  (1,  1,  1,2.  1,2,  1,3,  1,3,  1,4,  1,  1.  1). 

REMARKS.  For  most  pattern  recognition  problems  the  training  set  (X/  X)  satisfies  2nt  £  N  +  1. 
We  have  treated  the  case  2n,  >  N  +  1  for  the  sake  of  mathematical  completeness.  The  extreme  of  the  latter 
case  occurs  with  K  =  2,  and  n2  »  1.  Here,  ^^(X/  X)  =  1  or  2.  Let  y  be  the  single  member  of  X2.  If  y  is 
an  extreme  point  of  Hull(X),  then  L^fX/  X)  =  1.  Otherwise,  y  can  be  separated  from  X]  by  a  pair  of 
parallel  hyperplanes.  This  is  accomplished  by  first  choosing  any  hyperplane  H  through  y  that  is  disjoint 
from  Xj.  The  two  hyperplanes  are  then  chosen  parallel  to  H  and  on  opposite  sides  of  H,  sufficiently  close 
together  so  that  no  point  of  Xt  lies  between  them.  This  construction  generalizes  in  the  following  way,  and 
yields  an  improvement  on  Equation  8.1.  Suppose  Y  c  Xj  and  |  Y  |  £  d.  There  exists  a  hyperplane  H 
(which  is  unique  if  |  Y  |  =  d)  which  contains  Y  and  is  disjoint  from  X  -  Y.  Since  X  -  Y  is  finite,  one  can 
select  two  hyperplanes  parallel  to  H  and  on  opposite  sides  of  H  with  no  member  of  X  -  Y  lying  between 

them.  Repeating  this  contraction  on  d-subsets  of  X2,  X3 . XK,  one  obtains  a  family  9{ of  "k  hyperplanes 

that  separate  (X/  X)  where 

x.*(r?nrfi+-*r5D. 

This  provides  a  better  upper  bound  for  ^.^^(d,  N,  afl  when  n,  is  sufficiently  large. 

EXAMPLE  8.  Suppose  d  =*  3,  K  =  4,  N  =  50,  and  =  (32,  6,  6,  6).  Then  the  construction 
above  shows  that 

W3,  so,  90  s  2  (rf i + rf  i + r  f  i)  *  12, 

whereas  the  upper  bound  provided  by  Therorem  8  is  18. 


SEPARATION  OF  CONVEX  SETS 

We  assume  in  this  section  that  the  classes  to  be  separated  occupy  disjoint  regions  of  the  input  space 
R(d\  The  task  now  becomes  separation  of  regions  rather  than  finite  sets  of  points.  This  approach  is  useful 
in  conjunction  with  certain  data  analysis  techniques.  Both  cluster  analysis  and  density  estimation  yield 
regions  of  the  input  space  that  are  associated  with  single  classes.  Separating  these  regions  at  the  first  layer 
of  a  LNN  becomes  a  requirement  of  the  network  mapping.  Linear  separability  of  finite  sets  generalizes  to 
linear  separability  of  subsets. 

DEFINITION  8.  Two  subsets  Cj,  C2  of  R^  are  linearly  separable  provided  there  exists  a 
hyperplane  H  that  cuts  every  segment  joining  a  point  in  Cj  to  a  point  in  C2;  equivalently,  C1  and  C2  lie  in 
different  components  of  R®  -  H. 

DEFINITION  9.  A  set  W  of  hyperplanes  in  R^  separates  a  family  C  of  disjoint  subsets  of  RW) 
provided  that  every  segment,  which  joins  two  points  lying  in  different  members  of  C,  is  cut  by  at  least  one 
member  of  9C 

FACT  6.  Two  finite  subsets  of  are  linearly  separable  if  and  only  if  their  convex  hulls  are 
disjoint. 
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FACT  7.  Two  compact  convex  subsets  of  R(d)  are  linearly  separable  if  and  only  if  they  are 
disjoint. 

The  preceding  definitions  and  facts  are  used  implicitly  in  the  discussion  below. 

Clustering  within  classes  generates  partitions  of  each  class  into  disjoint  subsets.  In  general,  there  is 
no  guarantee  that  the  regions  for  Class  i  are  disjoint  from  those  of  Class  j.  Successive  refinements  of  the 
clusterings  will,  however,  yield  disjoint  sets. 

EXAMPLE  6.  For  a  K-class  problem,  we  have  training  data  (X/X  where 

*=  {Xj.Xj,  ....  XK) 
and 

X  =  Xx  u  x2  u ...  u  XK. 

At  step  1,  each  Xj  is  clustered  to  obtain  a  partition  Jft,  1)  of  Xj.  Replacing  each  element  of  A(i,  1)  with  its 
convex  hull  gives  a  covering  C(i,  1)  of  Xj  with  convex  sets.  At  step  r,  the  partition  .Xfi,  r  -  1)  is  refined 
(i.e.,  each  of  its  components  is  partitioned)  to  obtain  a  partition  X(i,  r)  of  Xj.  Again  the  convex  hulls  of 
the  members  of  JC(i,  r)  give  a  covering  <X i,  r)  of  Xj  by  convex  sets.  This  procedure  is  continued  through 
step  s,  when  all  of  the  members  of  the  total  covering 

as)  =  ai,  s)  u  a2,  s)  u ...  u  ok,  s) 

are  pairwise  disjoint.  This  is  possible,  since  the  partition  of  each  Xj  into  singletons  satisfies  the 
disjointness  requirement 

REMARK.  It  should  be  noted  that  a  set  #of  hyperplanes  that  separates  every  pair  of  sets  in  as) 
must  also  separate  (X/X).  Conversely,  if  we  have  a  set  of  hyperplanes  that  separate  (X/X),  the  convex 
hulls  of  the  unseparated  subsets  of  X  form  a  covering  of  X  by  disjoint  convex  sets.  An  obvious  question 
is:  Why  cluster  the  data,  rather  than  proceed  directly  to  a  search  for  a  separating  set  #of  hyperplanes?  We 
have  no  precise  answer.  However,  it  may  well  be  that  a  good  way  to  start  the  search  for  is  to  look  for 
clusters. 

Regions  may  also  be  associated  with  individual  classes  when  the  training  data  is  used  to  define 
density  functions.  For  K  density  functions  P],  pj, ....  pg,  one  may  assign  disjoint  regions  R1(  R2, ....  Rg 
to  the  classes  in  such  a  way  as  to  maximize  (some  function  of)  the  K  associated  probabilities: 

Pj  =  Prob[x  e  Rj  given  x  e  Class  i); 

for  such  a  model,  the  problem  ultimately  becomes  the  separation  of  the  K  regions.  As  with  the  clustering 
model,  each  Rj  may  be  replaced  by  a  finite  family  of  convex  sets  that  cover  Rj,  so  that  all  of  the  resulting 
sets  are  pairwise  disjoint. 

Thus,  we  proceed  to  consider  the  problem  of  separation  of  disjoint  convex  sets. 

DEFINITION  10.  For  C  a  finite  family  of  disjoint  compact  convex  subsets  of  R(d\  we  define 

Ymin(0  =  min{y :  there  exists  a  set  #of  y  hyperplanes  which  separates  C)  (10.1) 

N)  =  “'"(UO!  00.2) 
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Tmaajnin^’  N)  =  max{ymin(^},  (10.3) 

where  the  minimum  in  Equation  10.2  and  the  maximum  in  Equation  10.3  are  taken  over  all  families  Cof  N 
disjoint  convex  subsets  of  R(d*. 

Since  a  point  in  R(d^  is  a  convex  set,  separation  of  convex  sets  includes  reparation  of  points  as  a 
special  case.  The  following  theorem  is  a  consequence  of  Lemma  4  and  Theorem  3. 

THEOREM  9. 

N)  =  min {7 :  Regd(Y)  >  N). 

Theorem  9  says,  in  effect,  that  the  easiest  problems  for  convex  sets  are  just  the  easy  problems  for 
points.  However,  the  worst  case  cost  of  separability  is  much  greater  for  general  convex  sets  than  for  finite 
sets.  Here,  cost  is  measured  by  the  number  of  hyperplanes  required  for  separation.  We  employ  two 
techniques  to  establish  a  lower  bound  for  Yminmin(d,  N).  The  first  involves  constructing  families  of  convex 
sets  as  the  Voronoi  regions  of  finite  families  of  points.  The  second  technique  replaces  junction  points  in 
the  arrangement  of  Voronoi  regions  with  new  convex  sets. 

LEMMA  7.  7mtxmin(d,  N)  <  (*). 

PROOF.  Lemma  7  follows  from  Fact  7.  One  hyperplane  for  each  pair  of  disjoin  convex  sets 
suffices  for  separation. 

DEFINITION  11.  For  X  =  (xj,  x2, ...,  xN)  a  fini"'  subset  of  R(d\  we  define  the  Voronoi  region 
Vj  associated  with  xi  by 

V,=  (xe  R(d) :  ||x  -  xj|  =  min  (||x  -  Xj||} } . 

That  is,  Vj  is  the  set  of  points  whose  nearest  neighbor  in  X  is  Xj. 

FACT  8.  The  interiors  Vj*of  the  N  Voronoi  regions  for  {x,,  x2 . xN)  are  disjoint  convex  sets. 

Voronoi  regions  are  intimately  related  to  linear  discriminant  functions.  For  1  <,  i  <  N,  define  the 
nearest  prototype  discriminant  function  Fj  by 

F|(x)  =  HXjll2-  2  x  ■  Xj , 


and  for  1  £  i,  j  S  N  let 

Djj(x)  =  Fj(x)  -  Fj(x). 


Djj(x)  is  positive  at  those  points  that  are  closer  to  Xj  than  to  Xj.  Letting  H~  be  the  set  of  points  at  which 
Djj(x)  is  positive,  we  have 

V°  =  n(H£:lSj:SN,j*i}. 


Thus,  each  Vf  is  convex,  since  it  is  the  inters!  ction  of  half-spaces. 
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EXAMPLE  7.  Figure  5  shows  the  Voronoi  regions  for  a  set  X  of  four  points  in  R  .  The  points 
Xj,  x2,  and  x3  are  the  vertices  of  a  triangle,  and  x4  =  (xj  +  x2  +  x3)/3  is  the  centroid  of  the  triangle.  For 

each  pair  x(,  xj  of  points,  there  is  a  one-dimensional  boundary  separating  their  regions  V°  and  V°. 
Therefore,  separation  of  these  four  regions  requires  six  hyperplanes. 

The  regions  of  Figure  5  generalize  to  R(d),  d  >  3.  Let  x,,  x2 . xd+,  be  the  vertices  of  a  simplex 

in  R(d),  and  let  xd+2  =  (xj  +  x2  + ...  +  xd+,)/(d  +  1),  the  centroid  of  the  simplex.  Each  of  the(d  +  2)  pairs 

Vf,  V?  of  Voronoi  regions  shares  a  (d  -  l)-dimensional  boundary.  Since  no  two  of  these  boundary  regions 

lie  in  the  same  hypeiplane,  (  d  +  2)  3112  ^“ired  10  separate  the  d  +  2  convex  regions. 


LEMMA  8.  Yra„min(d,d  +  2)=  (d  +  2). 


PROOF.  The  simplex  and  its  centroid  are  sufficient  to  show  that  Ymixmjn(d,  d  +  2)  >  ^d  *  2J> 

and  from  Lemma  7  it  follows  that  Ym»xmin(d. d  +  2)  <  (^d  ^  2)- 

Proceeding  from  the  simplex  construction,  additional  open  convex  regions  may  be  added.  The 
collective  boundary  of  the  d  +  2  regions  resulting  from  the  simplex  and  its  centroid  contains  d  +  1  junction 
points,  i.e.,  points  at  which  d  +  1  distinct  (d  -  l)-dimensional  interfaces  meet.  By  adding  a  new  region, 

which  is  an  open  simplex  containing  a  junction  point,  one  obtains  d  +  3  regions  requiring  ^  d  ^  2)  +  d  +  1 

hyperplanes.  The  d  +  1  new  hyperplanes  contain  the  d  +  1  boundaries  of  the  new  region.  This  construction 
also  adds  d  +  1  new  junction  points  while  removing  one.  Thus,  each  additional  region  increases  the  number 
of  junction  points  by  d  and  the  number  of  required  hyperplanes  by  d  +  1 .  After  adding  k  new  regions,  there 
are  dk  +  d  +  1  junction  points  and  d  +  k  +  2  regions  requiring  a  total  of 
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7=(d22)+k(d+1) 

hyperplanes  for  separation.  Substituting  N  for  d  +  k  +  2  gives  the  following  theorem. 

THEOREM  10.  ymMmin(d,  N)  ;>  (d  +  1)N  -  (d  +  2). 
for  N  >  d  +  2. 

Thus,  the  cost  of  separation  grows  as  N/d  for  points,  and  at  least  as  fast  as  N(d  +  1)  for  convex  sets. 

CONJECTURE.  We  conjecture  that  2d  regions  in  RW)  may  require  2d2  -  d  hyperplanes,  one  for 
each  pair  of  regions.  If  this  is  the  case,  then 

7maxmin(d.  N)  >  (d  +  1)N  -  3d 

for  N  >  2d. 


SUMMARY 

The  feed-forward  layered  neural  network  is  the  simplest  of  the  neural  computing  devices  proposed  for 
systems  requiring  pattern  classification  capabilities.  The  number  of  first-layer  neurons  imposes  quantifiable 
limits  on  the  amount  of  separation  that  the  network  can  achieve  in  the  pattern  space.  Conversely,  the 
number  and  complexity  of  the  pattern  classes  force  minimal  requirements  on  the  size  of  the  first  layer  of 
neurons. 

This  report  establishes  bounds  on  the  number  of  first-layer  neurons — in  terms  of  input  dimension 
and  number  of  training  patterns — required  in  a  threshold  network.  Although,  in  general,  neurons  with 
continuous  transfer  functions  are  more  versatile  than  threshold  neurons,  the  separability  capabilities  of 
threshold  neurons  provide  a  baseline. 

In  order  to  completely  separate  N  points  (in  general  position)  in  d-dimensional  Euclidean  space,  a 

threshold  network  may  require  as  few  as  Reg^IN)  first-layer  neurons  or  as  many  as  N/d.  Regj'(N)  is  the 
minimum  value  of  X  for  which 

(o)  +  (i)  +  '"  +  (S)2N- 

For  example,  the  upper  and  lower  bounds  for  100  points  in  the  plane  are  14  and  50,  respectively.  For  5000 
points  in  five-dimensional  space,  we  obtain  bounds  of  16  and  1000.  One  does  not  expect  to  encounter 
either  of  these  extremes  in  real  applications.  The  upper  bound,  which  arises  from  data  lying  on  a  one¬ 
dimensional  curve  in  d-dimension  space,  is  particularly  unrealistic.  On  the  other  hand,  it  is  unwise  to  offer 
a  target  number  to  cover  all  contingencies,  since  neuron  requirements  will  obviously  depend  upon  the 
application  as  well  as  upon  the  values  of  d  and  N.  What  is  really  required  is  either  a  fundamental 
understanding  of  the  processes  giving  rise  to  the  patterns,  or  exploratory  data  analysis  of  the  training 
samples  to  determine  their  separability  requirements.  Either  of  these  additional  bodies  of  information  will 
often  yield  not  only  the  number  of  separating  hyperplanes  required,  but  the  hyperplanes  themselves.  The 
hyperplanes  in  turn  determine  the  weights  on  the  first  layer  of  connections. 

As  one  might  expect,  the  worst  case  separability  requirements  for  finite  families  of  convex  sets  are 
considerably  greater,  than  for  points.  For  N  disjoint  convex  subsets  of  d-dimensional  Euclidean  space,  the 
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1  2 

number  of  hyperplanes  required  to  separate  every  pair  of  subsets  ranges  from  Regj  (N)  to  (N  -  N)/2.  The 
lower  bound  is  the  same  as  for  points,  since  points  are  convex  sets.  For  N  <,  d  +  2,  the  upper  bound  is 
sharp.  Moreover,  for  small  (relative  to  the  dimension  d)  families  of  sets,  the  upper  bound  is  not  totally 
unrealistic.  For  a  family  of  multivariate  distributions,  the  Voronoi  regions  of  the  class  means  may  include 
sections  from  (N2  -  N)/2  hyperplanes  among  their  boundaries.  This  upper  bound  has  been  proved  here  only 
for  N  <  d  +  2.  We  conjecture  that  it  applies  for  N  <  2d.  In  any  event,  the  worst  case  hyperplane 
requirement  for  convex  sets  grows  at  least  as  fast  as  (d  +  1)N,  whereas  the  analogous  growth  for  points  is 
only  N/d. 

Sets  of  training  data  for  supervised  learning  (pattern  recognition)  consist  of  disjoint  unions  of  finite 
subsets  of  d-dimensional  space.  The  distinct  subsets  of  a  training  set  are  samples  from  a  single  class.  The 
separation  task  for  this  problem  is  to  create  convex  subsets  of  the  pattern  space  each  of  which  contains 
points  of  at  most  one  class.  Thus,  separation  of  all  pairs  of  points  is  not  the  objective.  Surprisingly,  in 
the  worst  case,  all  pairs  of  points  must  be  separated  in  order  to  separate  the  classes.  One  would  expect  this 
type  of  situation  to  arise  only  when  the  underlying  classes  are  nearly  identical.  That  is,  this  extreme 
nonseparability  among  the  classes  of  training  samples  indicates  an  unsolvable  pattern  recognition  problem. 
Indeed,  the  number  of  hyperplanes  required  for  class  separation  can  be  used  as  a  criterion  for  solvability  of 
the  problem.  As  the  requirements  increase,  solvability  decreases. 

Of  greater  interest  than  worst  case  neuron  requirements  are  expected  neuron  requirements.  Expected 
requirements,  however,  lead  to  the  same  dilemma  that  pervades  computational  complexity  questions  in 
theoretical  computer  science.  Expectations  are  dependent  upon  assumptions  regarding  the  distributions  of 
the  classes.  Model  2  leads  to  questions  of  this  type.  We  have  treated  the  worst  case  problem  at  some 
length  in  this  report  for  two  reasons.  The  first  is  simply  that  these  results  follow  easily  from  basic 
knowledge  of  convexity  and  combinatorial  geometry.  The  second,  more  important,  reason  is  the  need  to 
construct  a  firm  mathematical  framework  that  exhibits  the  intimate  relationship  between  the  pattern  space 
and  the  role  played  by  the  first  layer  of  neurons  in  the  classification  procedure.  Understanding  that  such 
bounds  exist  is  perhaps  more  helpful  than  knowing  their  exact  integer  values. 
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Appendix 

DISCRETE  HAM  SANDWICH  THEOREM 

Our  proof  of  the  discrete  Ham  Sandwich  Theorem  uses  the  Borsuk  Antipodal  Mapping  Theorem  and 
several  basic  facts.  The  following  definitions  will  prove  helpful. 

A  median  cut  off  a  finite  N-subset  X  of  R(d)  is  a  hyperplane  H  satisfying 
|H+nX|<lN/2j 
and 

|H’  nX\<,  LN/2J. 

The  d-dimensional  sphere  is  the  boundary  of  the  closed  unit  sphere  in  R(d+1^;  i.e., 

S(d)=  {ae  R(d+1):  ||a||  =  1}. 

For  u  e  Rw  and  1  £  i  £  N,  u(i)  denotes  the  value  of  the  i*  smallest  coordinate  of  u.  Note  that  u(i) 
cannot  always  be  associated  with  a  unique  coordinate  of  u. 

EXAMPLE.  If  N  =  6  and  u  «  (4,  -2,  1,  7,  3,  1),  then 


'(1) =  '2 

U(4)=3 

'(2)=  1 

U(5)  =  4 

II 

C 

£ 

II 

-0 

Here,  u(2)  is  equal  to  both  u3  and  u6. 

FACT.  For  1  <;  i  £  N,  the  function  wff :  R^  ->  R,  defined  by 
WfflW  =  u(i), 

is  continuous. 

We  define  the  median,  Med^ :  R^  ->  R,  as  follows:  \ 

u(r)  for  N  =  2r  - 1  I 

Med(N)(u)=  ? 

7(u(r)  +  lW  for  N  =  2r  ). 
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FACT.  Since  Med^  =  wj-^or  ”(w^  +  w^)),  Med^  is  continuous. 

Now  suppose  that  X  =  (x,,  x2 . xN}  is  a  finite  subset  of  R(d)  and  a  e  S(d']).  Define  fx(a)  by 

fx(a)  =  Med^VCxp  a),  P(x2,  a) . P(xN,  a)), 

where  P  denotes  inner  product, 

P(y.  a)  =  ya. 

Since  P :  R(d)  x  S(d  l)  -»  R  is  continuous,  the  mapping  fx  :  S(tM)  -4  R  is  continuous  for  every  finite  X  in 
R(d\  The  importance  of  the  family  of  mappings  fx  rests  upon  the  following. 

FACT.  The  hyperplane  H  perpendicular  to  a  and  passing  through  the  point  fx(a)a  is  a  median  cut 

for  X. 


Also  of  importance  is  the  fact  that  fx  is  antipodal,  i.e.,  fx(-a)  =  -fx(a)  for  all  a  e  S.  This  follows 
from  the  fact  that 


MedN(-u)  =  -MedN(u) 

for  all  u  g  R*^. 

The  following  topological  theorem  provides  the  fundamental  result  required  to  prove  the  Discrete 
Ham  Sandwich  Theorem. 

BORSUK  ANTIPODAL  MAPPING  THEOREM.  If  F  is  a  continuous  mapping  S^  to  R^ 
satisfying  F(-a)  =  -F(a)  for  all  a  g  S(d),  then  0  lies  in  the  range  of  F.  That  is,  for  some  a  e  S 

F(a)  =  (0.  0 . 0). 

THEOREM  6.  Discrete  Ham  Sandwich  Theorem  (Reference  7). 

Suppose  we  have  d  sets  X(l),  X(2) . X(d)  in  R(d),  with  |  X(i) !  =njand  uX(i)  in  general 

position.  Then  there  exists  a  hyperplane  H  such  that  H  bisects  every  X(i).  That  is,  for  1  <  i  <  d, 

I  X(i)nH  |  and  |  X(i)nH  |  differ  by  at  most  1.  For  n^  odd  the  difference  is  1,  and  for  n;  even  the 
difference  is  0. 

PROOF.  For  1  S  i  S  d-1,  define  F; :  S(d  l)  R  by 


Fj(a)  *  fx(i)(a)  -  fX(d)(a)- 


Since  F;  is  the  difference  of  continuous  functions  from  S(d_1)  to  R,  Fj  is  continuous.  Hence,  the  function  F 
from  S(d  ^  to  R^d1^  defined  by 


F(a)  =  (Fj(a),  F2(a),  ...,Fd.,(a)) 
is  also  continuous.  Moreover,  F(-a)  =  -F(a),  since 
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F;(-a)  *  -Fj(a)  for  1  <  i  £  d  -1. 

Applying  the  Borsuk  Antipodal  Mapping  Theorem  to  F  gives  us  a  point  a  e  $*d'^  for  which  F(a)  *  0. 
Thus,  for  1  <  i  <  d  -1,  fx(i)(a)  =  fx(d)(a)-  ^  follows  that  all  fX(i>(a)  =  A,  where  A  is  constant,  and  the 

d 

hyperplane  L  through  Aa  perpendicular  to  a  is  a  median  cut  for  each  of  the  sets  X(i).  Let  X  =  UX(i), 

i=l 

X'  =  XnL',X°  =  XnL,X+  =  Xn  L+.  Similarly  we  let  X*(i)  =  X(i)  n  L‘,  X°(i)  =X(i)  n  L,  and  X+(i) 
=  X(i)  n  L+,  for  1  5  i  <  d. 

If  X°  is  empty,  then  H  =  L  bisects  all  X(i).  If  not,  we  must  rotate  L  in  order  to  obtain  the  bisector 

H.  Suppose  X°  is  non-empty.  Let  n"  =  |  X"(i)  | ,  nf  =  I  X°(i)  I ,  and  n*  =  |  X+(i)  | ,  for  1  <  i  <  d.  Since  X 
is  in  general  position,  X°  contains  at  most  d  points,  and  these  points  lie  in  general  position  in  L.  Thus, 
any  split  of  X°  into  two  subsets  may  be  effected  by  a  (d  -  l)-flat  in  L  (the  (d  -  l)-flats  are  the  hyperplanes  of 
L).  Our  task  is  to  select  the  split  of  X°  so  as  to  bisect  all  of  the  X(i). 

Let  m~  »  Lnj/2j  -  n”,  m*  =  -  n^  Since  L  is  a  median  cut  for  X(i),  m“  and  m*  are  non¬ 
negative.  Moreover,  m'  +  m+  =  -  n~  -  n*  =  nf .  Thus,  we  may  partition  X°  into  Y\  Y+  as  follows. 

d 

Split  X~  into  disjoint  subsets  Y",  Yf  of  cardinalities  m~  and  m*  respectively,  and  let  Y*  =  U  Y\ ,  Y+  = 

i=l 

d 

U  Yj.  Choose  a  (d  -  l)-flat  K  in  L  which  splits  X°  into  Y'  and  Y+.  Let  W  be  the  family  of  hyperplanes  in 
i=l 

R(d)  which  contain  K.  Every  member  of  H,  except  L,  splits  X°  into  Y*  and  Y+.  Since  X  is  finite,  there  is 
a  neighborhood  Xof  L  in  X,  all  of  those  members  split  X\X°  into  X'  and  X+.  We  choose  H  to  be  a 
member  of  !A&{L} .  Then  H  splits  X  into  X*  u  Y~  and  X+  u  Y+.  It  follows  that  H  bisects  every  X(i). 
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